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Abstract. Wc develop an abstract model of information acquisition 
from redundant data. We assume a random sampling process from data 
which provide information with bias and are interested in the fraction 
of information we expect to learn as function of (i) the sampled fraction 
(recall) and (ii) varying bias of information (redundancy distributions). 
We develop two rules of thumb with varying robustness. We first show 
that, when information bias follows a Zipf distribution, the 80-20 rule 
or Pareto principle does surprisingly not hold, and we rather expect to 
learn less than 40% of the information when randomly sampling 20% 
of the overall data. We then analytically prove that for large data sets, 
randomized sampling from power-law distributions leads to "truncated 
distributions" with the same power- law exponent. This second rule is 
very robust and also holds for distributions that deviate substantially 
from a strict power law. Wo further give one particular family of power- 
law functions that remain completely invariant under sampling. Finally, 
we validate our model with two large Web data sets: link distributions 
to domains and tag distributions on delicious.com. 



1 Introduction 

The 80-20 rule (also known as Pareto principle) states that, often in life. 20% of 
effort can roughly achieve 80% of the desired effects. An interesting question is 
as to weather this rule also holds in the context of information acquisition from 
redundant data. Intuitively, we know that we can find more information on a 
given topic by gathering a larger number of data points. However, we also know 
that the marginal benefit of knowing additional data decreases with the size of 
the corpus. Does the 80-20 rule hold for information acquisition from redundant 
data? Can we learn 80% of URLs on the Web by parsing only 20% of the web 
pages? Can we learn 80% of the used vocabulary by looking at only 20% of the 
tags? Can we learn 80% of the news by reading 20% of the newspapers? More 
generally, can we learn 80% of all available information in a corpus hy randomly 
sampling 20% of data without replacement? 

In this paper, wc show that when assuming a Zipf redundancy distribution, 
the Pareto principle does not hold. Instead, we rather expect to see less than 
40% of the available information. To show this in a principled, yet abstract 
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Fig. 1: Processes of information dissemination and information acquisition. 
We want to predict the fraction of information we can learn (f,,) as a func- 
tion of recall (r) and the bias in the data (redundancy distribution p). 



fashion, we develop an analytic sampling model of information acquisition from 
redundant data. We assume the dissemination of relevant information is biased, 
i.e. different pieces of information are more or less frequently represented in 
available sources. We refer to this bias as redundancy distribution in accordance 
with work on redundancy in information extraction '11' . Information acquisition, 
in turn, can be conceptually broken down into the subsequent steps of IR, IE, and 
II, i.e. visiting a fraction r of the available sources, extracting the information, 
and combining it into a unified view (see Fig. 1 1 . Our model relies on only 
three simple abstractions: (1) we consider a purely randomized sampling process 
without replacement; (2) we do not model disambiguation of the data, which is a 
major topic in information extraction, but not our focus; and (3) we consider the 
process in the limit of infinitely large data sets. With these three assumptions, 
we estimate the success of information acquisition as function of the (i) recall of 
the retrieval process and (ii) bias in redundancy of the underlying data. 

Main contributions. We develop an analytic model for the information 
acquisition from redundant data and (1) derive the 40-20 rule, a modification of 
the Pareto principle which has not been stated before. (2) While power laws do 
not remain invariant under sampling in general |26j , we prove that one particular 
power law family does remain invariant. (3) While other power laws do not 
remain invariant in their overall shape, we further prove that the "core" of such 
a frequency distribution does remain invariant; this observations allows us to 
develop a second rule of thumb. (4) We validate our predictions by randomly 
sampling from two very large real-world data sets with power-law behavior. 

This is the full version of a conference paper [TB] (pages 1-14). All proofs 
and further details are contained in the appendix. 



2 Basic notions used throughout this paper 

We use the term redundancy as synonym for frequency or multiplicity. We do so 
to remain consistent with the term commonly used in web information extrac- 
tion, referring to the redundant nature of information on the Web. The notions 
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Fig. 2: (a): Redundancy distribution p. (b): Representation as redundancy 
frequency distribution a. (c) : Recall r — and unique recall r-u = | . 



of data and information are defined in various and partly contradicting ways 
in the information retrieval, information extraction, database and data integra- 
tion literature. In general, their difference is attributed to novelty, relevance, 
organization, available context or interpretation. The most commonly found un- 
derstanding is that of data as representation of information which can become 
information when it is interpreted as new and relevant in a given context [H] . In 
this work, we follow this understanding of data as "raw" information and use 
the term data for the partly redundant representation of information. 

Let a be the total number of data items and Uu the number of unique pieces 
of information among them. Average redundancy p is simply their ratio p = —. 
Let Pi refer to the redundancy of the i-th most frequent piece of information. 
The redundancy distribution p (also known as rank-frequency distribution) is the 
vector p = (pi, . . . , Pa^ )- Figure 2a| provides the intuition with a simple balls- 



and-urn model: Here, each color represents a piece of information and each ball 
represents a data item. As there are 3 red balls, redundancy of the information 
"color = red" is 3. Next, let ak be the fraction of information with redundancy 
equal to fc, fc S [fcmax]- A redundancy frequency distribution (also known as 
count-frequency plot) is the vector a = (ai, . . . , ak^_^^)- It allows us to describe 
redundancy without regard to the overall number of data items a (see Fig. 2b I 
and, as we see later, an analytic treatment of sampling for the limit of infinitely 
large data sets. We further use the term redundancy layer (also known as com- 
plementary cumulative frequency distribution or ccfd) rj^ to describe the fraction 
of information that appears with redundancy > k: r/k — X)f=fc'' ^i- I^^r example, 
in Fig. 2a the fraction of information with redundancy at least 3 is 773 = + ae 



= I -|- ^ = I . Finally, recall is the well known measure for the coverage of a data 
gathering or selection process. Let 6 be a retrieved subset of the a total data 
items. Recall is then r — 

a 

We define unique recall as is its counterpart for unique data items. Thus, it 
measures the coverage of information. Let 6„ be the number of unique pieces of 
information among b, and a„ the number of unique pieces of information among 
a. Unique recall is then r„ = We illustrate again with the urns model: 
assume that we randomly gather 3 from the 15 total balls (recall r — j^) and 
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Fig. 3: (a, b): Random sampling from an urn filled with a balls in a„ different 
colors. Each color appears on exactly p=— balls, (c): Normalized sample 



distribution in grey with unique recall f„~0.8 for r = 0.5 from a in Fig. 2b 



that, thereby, we learn 2 colors out of the 5 total available colors (Fig. 2c). 



Unique recall is thus r„ = | and the redundancy distribution of the sample is 
P = (2,l). 



3 Unique recall 



We next give an analytic description of sampling without replacement as function 
of recall and the bias of available information in the limit of very large data sets. 

Proposition 1 (Unique recall fu). Assume randomized sampling without re- 
placement with recall r S [0, 1] from a data set with redundancy frequency distri- 
bution (X. The expected value of unique recall for large data sets is asymptotically 

concentrated around = 1 — X]fc=T (1 ^ f )^ 



The proof applies Stirling's formula and a number of analytic transforma- 
tions to a combinatorial formulation of a balls-and-urn model. The important 
consequence of [Prop, l] is now that unique recall can be investigated without 
knowledge of the actual number of data items a, but by just analyzing the nor- 
malized redundancy distributions. Hence, we can draw general conclusions for 
families of redundancy distributions assuming very large data sets. To simplify 
the presentation and to remind us of this limit consideration, we will use the hat 
symbol and write f„ for lima_j.co E ~ E (r„). 

[Figure 3| illustrates this limit value with two examples. First, assume an urn 
filled with a balls in a„ different colors. Each color appears on exactly two balls, 
hence p — 2 and a — 2a„. Then the expected value of unique recall r„ (fraction 
of colors sampled) is converging towards 1 — (1 — rY and its variance towards 
for increasing numbers of balls a (Fig. 3a Fig. 3b). For example, keeping p = 2 
and r = 0.5 fixed, and varying only a = 4, 6, 8, 10, . . ., then unique recall varies 
as r„ = 0.83, 0.80, 0.79, 0.78, . . ., and converges towards f„ = 0.75. At a = 1000, 
r„ is already 0.7503 ±0.02 with 90% confidence. Second, assume that we sample 



50% of balls from the distribution a= (|, |, |, 0, 0, |) of Fig. 2b Then we can 
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expect to learn ss80% of the colors if a is very large (Fig. 3c). In contrast, exact 
calculations show that if a = 15 as in |Fig. 2a\ then the actual expectation is 
around ~ 79% or « 84% for sampling 7 or 8 balls, respectively. Thus, Prop. 1 



calculates the exact asymptotic value only for the limit, but already gives very 
good approximations for large data sets. 



4 Unique recall for power law redundancy distributions 



Due to their well-known ubiquity, we will next study power law redundancy dis- 
tributions. We distinguish three alternative definitions: (1) power laws in the 
redundancy distributions, (2) in the redundancy frequencies, and (3) in the re- 
dundancy layers. These three power laws are commonly considered to be different 
expressions of the identical distribution |3l21j because they have the same tail 
distributiorj^ and they are in fact identical in a continuous regime. However, for 
discrete values, these three definitions of power laws actually produce different 
distributions and have different unique recall functions. We will show this next. 

Power laws in the redundancy distribution p. This distribution arises 
when the frequency or redundancy p of an item is proportional to a power law 
with exponent S of its rank i: p{i) cx i~^ , i € [a,i]. Two often cited examples of 
such power law redundancy distributions where 5 w 1 are the frequency-rank 
distribution of words appearing in arbitrary corpora and the size distribution 
of the biggest c ities for most countries. These are called "Zipf Distribution" 

i-Er=i ((2^- 



after 



Using Prop. 



we can derive in a few steps r„p(r, i5) 



-(2/c + l)-7)(l-r) 
infinite sum can be reduced to f„p(r, S 



For the particularly interesting case of 6 = 1, this 

=1) = 7fcartanh(VT^). 
Power laws in the redundancy frequency distribution a. This dis- 
tribution arises when a fraction of information ak that app ears exac tly k times 
follows a power law ak — C ■ k^^, A; € Ni. Again, using Prop. 1 we can de- 
rive in a few steps r„c[(r, /3) = 1 — '^''^(^) ; where Li^(a;) is the polylogarithm 
Li^(a;) = X^fcLi k~^x'', and C,[j3) the Riemann zeta function (^(/3) = y^^, k~ 



Ecx 
k=l 

Power laws in the redundancy layers rj. This distribution arises when 
the redundancy layers rjk € [0, 1] follow a power law rjk (x k~'^ . From 771 = 1 
get rjk = k~^ and, hence, ak — k~^ — {k -\- . Using again Prop. 1 
a few steps f. 



we 
we get in 



Li-y(l — r). For the special case of 7 = 1, we can use 



the property Lii(x) = — ln(l — a;) and simplify to f^«,),(7', 7=1) = —^jz^- 

Comparing unique recall for power laws. All three power laws show 
the typical power law tail in the loglog plot of the redundancy distribution 
(loglog rank- frequency plot), and it is easily shown that the exponents can be 
calculated from each other according to Fig. 4e| However, the distributions are 
actually different at the power law root ( Fig. 4a ) and also lead to different unique 



recall functions. Figure 4b shows their different unique recall functions for the 
particular power law exponent of 7 = 1 (/3 = 2, 5 = 1), which is assumed to be 



^ With tail of a distribution, we refer to the part of a redundancy distribution for 
77 — 0, with root to 77 — >■ 1, and with core to the interval in between (see Fig. 4a I. 
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Fig. 4: The three power law redundancy distributions (e) have the same 
power law tail and power law core but different power law roots in the loglog 
redundancy plot (a). This leads to different unique recall functions (b&d), 
and different fractions of information learned after sampling 20% data (c). 



the most common redundancy distribution of words in a large corpus ^27^ and 
many other frequency distributions |21l22j . Given our normaUzed framework, we 
can now ask the interesting question: Does the 80-20 rule hold for information 
acquisition assuming a Zipf distribution? Put differently, if we sample 20% of 
the total amount of data (e.g. read 20% of a text corpus, or look at 20% of all 
existing tags on the Web), what percentage of the contained information (e.g. 
fraction of different words in a corpus or the tagging data) can we expect to 
learn if redundancy follows a Zipf distribution? [Figure 4c| lists the results for 
the three power law distributions and shows that, depending on which from the 
three definitions we choose, we can only expect to learn between 32% and 40% of 
the information. Note that we can apply this rule of thumb without knowing the 
total amount of available information. Also note that these numbers are sensitive 
to the power law root and, hence, to deviations from an ideal power law. This 
is also why unique recall diverges for our 3 variations of power law definitions 



in the first place (Fig. 4a). Finally, Fig. 4d shows that the power law exponent 



would have to be considerably different from 7 = 1 to give a 80-20 rule. 

Rule of thumb 1 (40-20 rule). When randomly sampling 20% of data whose 
redundancy distribution follows an exact Zipf distribution, we can expect to learn 
less than 40% of the contained information. 



5 K-recall and the evolution of redundancy distributions 

So far, we were interested in the expected fraction of information we learn 
when we randomly sample a fraction r of the total data. We now generalize the 
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question and derive an analytic description of the overall shape of the expected 



sample redundancy distribution (Fig. 3c). As it turns out, and what will become 



clear in this and the following section, the natural way to study and solve this 
question is again to analyze the horizontal "evolution" of the redundancy layers t) 
during sampling. To generalize unique recall r^, we define k-recall as the fraction 
ruk of information that has redundancy > k and also appears at least k times in 
our sample. More formally, let a^k be the number of unique pieces of information 
with redundancy > fc in a data set, and let be the number of unique pieces of 
information with redundancy > k in a sample. K-recall r^k is then the fraction 
of auk that has been sampled: r^k ~ The special case r„i is then simply 
the so far discussed unique recall r^. We assume large data sets throughout this 
and all following section without always explicitly using the hat notation f^k- 

K-recall has its special relevance when sampling from partly unreliable data. 
In such circumstances, the general fall-back option is to assume a piece of in- 
formation to be true when it is independently learned from at least k different 
sources. This approach is used in statistical polling, in many artificial intelligence 
applications of learning from unreliable information, and in consensus-driven de- 
cision systems: Counting the number of times a piece of information is occurring 
(its support) is used as strong indicator for its truth. As such, to believe a piece of 
information only when it appears at least k times in a random sample serves as 
starting point from which more complicated polling schemes can be conceived. 
In this context, r^k gives the ratio of information that we learn and consider true 
(it appears > k times in our sample) to the overall information that we would 



consider true if known to us (it appears > k times in the data set) (Fig. 5a). 

We also introduce a variable Uk for the fraction of total information we get 
in our sample that appears at least k times instead of just once. Note that 
t^fc — Vki'uk — All ujk with k £ [fcniax] together form the vector O) repre- 
senting the sample redundancy layers in a random sample with r € [0, 1]. As r 
increases from to 1, it "evolves" from the fcmax-dimensional null vector to 
the redundancy layers T| of the original redundancy distribution. Because of this 
intuitive interpretation, we call evolution of redundancy the transformation of a 
redundancy frequency distribution given by the redundancy layers T| to the ex- 
pected distribution cu as a function of r: r] cu, r e [0, 1]. We further use Ak to 
describe the fraction of information with redundancy exactly fc: A^. = ujk —uJk+i- 
To define this equation for all fc S Nq, we make the convention luq — 1 and cj^. = 
for fc > fcmax- We can then derive the following analytic description: 

Proposition 2 (Sample distribution cu). The asymptotic expectation of the 
fraction of information cdk that appears with redundancy > k in a randomly sam- 
pled fraction r without replacement from a data set with redundancy distribution 

a IS L^k = l-Y.yZlJ27=y(^^Or^i'^-ry''^ /or lim^^oo- 

The first part of the proof constructs a geometric model of sampling from 
infinitely large data sets with homogenous redundancy and derives the binomial 
distribution as evolution of the redundancy layers. The second part then applies 
this result to stratified sampling from arbitrary redundancy distributions. 
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Fig. 5: Given a redundancy frequency distribution a and recall r. K- recall 
Tuk describes the fraction of information appearing > k times that also 
appears > k times in our sample: r^k ~ ^ (a). Sampling from completely 
developed power laws leads to sample distributions with the same power law 
tail, and Vuk ~ r'' holds independent of k for A: > 10 (b). Truncated power 
laws are cut off at some maximum value fcmax (c). As a consequence, the 
tails of the sample distributions "break in" for increasingly lower recalls 
(d). However, the invariant power law core with Vuk ~ r'' is still visible. 



6 The Evolution of power laws 

Given the complexity of |Prop. 2} it seems at first sight that we have not achieved 
much. As it turns out, however, this equation hides a beautiful simplicity for 
power laws: namely, their overall shape remains "almost" invariant during sam- 
pling. We will first formalize this notion, then prove it, and finally use it for 
another, very robust rule of thumb. 

We say a redundancy distribution a is invariant under sampling if, indepen- 
dent of r, the expected normalized sample distribution A/toi is the same as the 
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original distribution: ^ = ak- Hence, for an invariant distribution it holds that 
Wfc = J2T=k+i = 1^1 Y,'^=k+i = wi77fc, and, hence, r„fc is independent of k: 
r^k — ^i- With this background, we can state the fohowing lemma: 

Lemma 1 (Invariant family). The following family of redundancy distribu- 
tions is invariant under sampling: ak = (— 1)'^'^"'^ (I), with < r < 1. 



The proof of Lemma 1 succeeds by applying [Prop. 2 to the invariant fam- 



ily and deriving ruk = f'^ after application of several binomial identities. Note 
that the invariant family has a power law tail. We see that by calculating its 
asymptotic behavior with the help of the asymptotic of the binomial coefficient 
Q = O (feiW), as A: ^- cx), for r ^ N. Therefore, we also have = O 
for k ^ oo. Comparing this equation with the power-law in the redundancy fre- 
quency plot, Qffc cx k~^ , we get the power-law equivalent exponent as /? = r + 1, 
with 1 < /3 < 2. Also note that the invariant family is not "reasonable" according 
to the definition of [2], since the mean redundancy Vk is not finite. 

We next analyze sampling from completely developed power laws, i.e. distribu- 
tions that have infinite layers of redundancy (/cmax — ^ oo)- Clearly, those cannot 
exist in real discrete data sets, but their formal treatment allows us to also con- 
sider sampling from truncated power laws. The latter are real-world power law 



distributions which are truncated at /cmax & N (Fig. 5c I . We prove that the power 
law core remains invariant for truncated power laws, and they, hence, appear as 
"almost" invariant over a large range, i.e. except for their tail and their root. 

Lemma 2 (Completely developed power laws). Randomized sampling with- 
out replacement from redundancy distributions with completely developed power 
law tails iXc leads to sample distributions with the same power law tails. 

Theorem 1 (Truncated power law distributions). Randomized sampling 
without replacement from redundancy distributions with truncated power law tails 
ar leads to distributions with the same power law core but further truncated 
power law tails. 



The proof for Lemma 2 succeeds in a number of steps by showing that 
limij_j.oo ^ = 1 for distributions with Uc- The proof of 



Theorem 1 



builds 

upon this lemma and shows that lim/c^gfe^^^^ Z\fe(aT, f") = Ak{(Xc,i^)- In other 
words, [Theorem 1| states that sampling from real-world power law distributions 
leads to distributions with the same power law core but possibly different tail 
and root. More formally, r^k ~ r'^ for ki < k < k2, where ki and k2 depend 
on the actual distribution, maximum redundancy and the power law exponent. 
Both, tail and root, are usually ignored when judging whether a distribution 
follows a power law (cf. Figure 3 in [5j), and to the best of our knowledge, this 
result is new. Furthermore, it is only recently that Stumpf et al. [35] have shown 
that sampling from power laws does not lead to power laws in the sample, in 
general. Our results clarifies this result and shows that only their tails and roots 
are subject to cfeange. [Figure 5bl|Fig. 5c] and |Fig. 5d| illustrate our result. 
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Rule of thumb 2 (Power law cores) . When randomly sampling from a power 
law redundancy distribution, we can expect the sample distribution to be a power 



law with the same power law exponent in the core: r^k ~ for ki < k < k2 



7 Large real-world data sets 

Data sets. We use two large real-world data sets that exhibit power-law char- 
acteristics to verify and illustrate our rules of thumb: the number of links to web 
domains and the keyword distributions in social tagging applications. 

(1) The first data set is a snapshot of a top level domain in the World Wide 
Web. It is the result of a complete crawl of the Web and several years old. 
The set contains 267,415 domains with 5.422,730 links pointing between them. 



From Fig. 6a we see that the redundancy distribution follows a power law with 
exponent 7 sa 0.7 (J3 sa 1.7, S « 1.43) for k > 100. Below 100, however, the 
distribution considerably diverges from this exponent, which is why we expect 
that rule of thumb 1 does not apply well. We now assume random sampling 
amongst all links in this data set (e.g. we randomly choose links and discover 
new domains) and ask: (i) what is the expected number of domains and their 
relative support (as indicated by linking to it) that we learn as function of the 
percentage of links seen? (ii) what is the fraction of domains with support > k 
in the original data that we learn with the same redundancy? 

(2) The second data set concerns different keywords and their frequencies 



used on the social bookmarking web service Delicious (http://delicious.com 



A total number of « 140 Mio tags are recorded of which sa 2.5 Mio keywords 



are distinct [7]. The redundancy distribution (Fig. 6c) follows a power law with 
exponent 7 « 1.3 {/3 ~ 2.3, S ~ 1.3) very well except for the tail and the very 
root. Here we assume random sampling amongst all individual tags given by 
users (e.g. we do not have access to the database of Delicious, but rather crawl 
the website) and ask: (i) what is the expected number of different tags and their 
relative redundancies that we learn as function of the percentage of all tags seen? 
(ii) what is the fraction of important tags in the sample (tags with redundancy 
at least k) that we can also identify as important by sampling a fraction r? 



Results. From Fig. 6b and Fig. 6d we see that after sampling 20% of links 



and tags respectively, we learn 60% and 40% of the domains and words, re- 
spectively. Hence, our first rule of thumb works well only for the second data set 
which better follows a power law. Our second rule of thumb, however, works well 
for both data sets: In |Fig. 6bl we see that, in accordance with our predication, 
the horizontal lines for r^^ = r^ become apparent for 10^ < k < 10^, and in 



Fig. 6d for 10^ < k < lO'* (compare with our prediction in Fig. 5d). 



8 Related Work 



Whereas the influence of redundancy of a search process has been widely ana- 
lyzed |5|19|25) , and randomized sampling used in other papers in this field |ll)19j , 
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Fig. 6: Original and sample web link distribution (a). Resulting k-recalls 
averaged over A'^ = 100 repetitions (b). Original and sample tag distribution 
on Delicious (c). Resulting k-recalls averaged over A'^ = 10 repetitions (d). 



our approach is new in the way that we analytically characterize the behavior of 
the sampling process as a function of (i) the bias in redundancy of the data and 
(a) recall of the used retrieval process. In particular, this approach allows us to 
prove a to date unknown characteristics of power laws during sampling. Achliop- 
tas et al. [2] give a mathematical model that shows that traceroute sampling from 
Poisson-distributed random graphs leads to power laws. Their analysis is limited 
to "reasonable" power laws, which are such for which a > 2 and also assumes a 
very concrete sampling process tailored to their context. This is in contrast with 
our result which proves that completely developed power law functions retain 
their power law tail, and truncated power laws at least their power law core 
during sampling. Haas et al. [15] and Chaudhuri et al. [3] investigate ways to 
estimate the number of different attribute values in a given database. This prob- 
lem is related in its background but different from its focus. We estimate the 
number of unique attributes seen after sampling a fraction and later the overall 
sample distribution. Stump et al. [26] show that, in general, power laws do not 
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remain invariant under sampling. In this paper, we could show that - while not 
in their entirely - at least the core of power laws remains invariant under sam- 
pling. General balls-and-urn models have been treated in detail by Gardy |14| . 
Gardy showed a general theorem which contains [Prop. l| as a special case. How- 
ever, she does neither investigate the behavior of power laws during sampling, 
nor extends this result to the evolution of the overall distribution [Prop. 2[ Only 
the later allowed us to investigate the overall shape of redundancy distributions 
during sampling. Flajolet and Sedgewick [T^ study the evolution of balanced, 
single urn models of finite dimensions under random sampling, where dimension- 
ality refers to the number of colors. Using methods of analytic combinatorics, 
they can associate an ordinary differential system of the same dimension to any 
balanced urn model, and that an explicit solution of the differential systems pro- 
vides automatically an analytic solution of the urn model. They mainly focus on 
urn models of dimension 2 (i.e., balls can be of either of two colors), and also 
solve some special cases for higher dimensions. They further note, that there is 
no hope to obtain general solutions for higher dimensions, however, that special 
cases warrant further investigation. Using a similar, but slightly different nomen- 
clature, we also studied a special case of balanced, single urn models, however 
with infinite dimension (i.e., infinite number of colors). We further showed that 
the case of infinite dimensions allows simple analytic solutions which very closely 
represent cases with high dimensionality. 



In [TS], we gave Prop. 1 and motivated the role of different families of re- 



dundancy distributions on the effectiveness of information acquisition. However, 
we did not treat the case of power laws, nor the evolution of distributions dur- 
ing sampling (Prop. 2 1. To the best of our knowledge, the main results in this 
paper are new. Our analytic treatment of power laws during sampling, the in- 
variant family, and the proof that sections of power laws remain invariant are 
not mentioned in any prior work we are aware of (cf. ^I0.12„13.14.21..22 2A^ ). 



9 Discussion and Outlook 

Our target with this paper was to develop a general model of the information 
acquisition process (retrieval, extraction and integration) that allows us to esti- 
mate the overall success rate when acquiring information from redundant data. 
With our model, we derive the 40-20 rule of thumb, an adaptation of the Pareto 
principle. This is a negative result as to what can be achieved, in general. A 
crucial idea underlying our mathematical treatment of sampling was adopting 
a horizontal perspective of sampling and thinking in layers of redundancy ( "k- 
recall"). Whereas our approach assumes an infinite amount of data, we have 



shown our approximation holds very well for large data sets (see Fig. 2b). We 
have focused on power laws, as they are the dominant form of biased frequency 
distributions. Whereas Stump et al. |26| have shown that, in general, power laws 
do not remain invariant under sampling, we have shown that (z) there exists 
one concrete family of power laws which does remain invariant, and (ii) while 
power laws do not remain invariant in their tails and root, their core does remain 
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invariant. And we have used this observation to develop a second rule of thumb 
which turns out to be very robust (cp. Fig. 5d with Fig. 6d). In future work, 



we intend to extend this analytic method to depart from the pure randomized 
sampling assumption and incorporate more complicated retrieval processes. 
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A Nomenclature 



a 
b 
r 

[L 

bu 

Tu 

P 

pi 

P 



^uk 
buk 
^max 
Ofc 

a 

rik 

OJk 

tu 

ruk 

Ak 
A 



total data items 

sampled data items 

recall or coverage of data = b/a 

subscript for "unique" information contained in data 

total pieces of information 

acquired pieces of information 

unique recall or coverage of information = bu/ciu 

average redundancy = a/a„ 

redundancy of a piece of information with rank i 

redundancy distribution = (pi , . . . , pa„ ) 

accent for approximation by limes = lima->oo E ([ ]) 

approximate unique recall 

subscript for information with redundancy k 

pieces of information with redundancy > k 

pieces of information acquired with redundancy > k 

maximum redundancy 

fraction of information with redundancy k 

redundancy frequency distribution = (qi, . . . ,cifc„,i,x) 

fraction of information with redundancy > k 

normalized redundancy layers = (771, . . . , rj^^^^) 

evolution of the fc-th redundancy layer — buk / a-u 

vector of sample evolution = {cji , . . . , ij^k^^^ ) 

fc-recall = buk /auk 

vector of fc-recall — {rui, ■ ■ ■ , ruk„^^„) 

fraction of information with redundancy = fc in a sample 
redundancy frequency distribution of sample = {Ai , . . . , Ak„ 



(a) 





Original 


Sample 






redundancy 


redundancy 


Relative 




distribution 


distribution 


fraction 


Fraction with 
redundancy = k 


ak & a. 


AkG A 


ek = ^ 


Fraction with 
redundancy > k 


Vk €r[ 


ujk G tu 





(b) 



Fig. 7: Variables used in this paper (a). Original and sample redundancy 



distributions and their ratios (b). Also compare with Fig. 13a and Fig. 13b 



One basic abstraction used throughout this paper is that of a redundancy 
distribution, illustrated with a colored balls and urn model in |Fig. 2a| The 
vertical axis shows redundancy for each color and the horizontal axis lists colors 
in order of decreasing values. The two axes information and redundancy span 
the area of data. In short: 



data — information x redundancy 
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B Details Section 3 (unique recall) 



B.l Unique recall for uniform distributions 



Here, we first show [Prop. l| for the special case of a uniform redundancy distri- 
bution p = const, then treat the general case in Section B.2[ 



Lemma 3 (Uniform unique recall). Assume randomized sampling from a 
data set with uniform redundancy distribution p with pi — p, and let r be the 
recall of the underlying data gathering process. Then the asymptotic expectation 
of unique recall ru for large data sets is 



lim E (r„) = = 1 - (1 - r) 



(1) 



Proof. Assume an urn filled with balls in different colors (Fig. 8). Each color 
appears on exactly p different balls, which makes a total number of a = pa^ 
balls. We now randomly draw b balls from the urn without replacement. What 
is the average number of different colors &„ we are expected to see? 



c P 



Pi 



pa„ = a balls 



ooo#oo 
oooooo 
oooooo 
ooo#oo 



Colors 

Fig. 8: An urn filled with a balls in different colors. 



When randomly drawing b from a balls, the outcome is any of (^) equally 
likely subsets. The number of those subsets in which any given color i does not 
appear is the number of possible subsets when choosing 6 from a — p balls, ("^'')- 
Hence, the likelihood that any color i does not appear in a random sample is 
the fraction of those numbers, or 

P IXC) = nl = # of subsets without i ^ 
^^'^ ^ # of total subsets (°) ' 

The likelihood of color i appearing in the sample at least once is the compliment 
of this fraction, or 

P[Xi^)>l] = ^-^ . (2) 
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As this equation holds for each color independently and we can, therefore, treat 
the likelihood of appearance for any color as independent events, the expected 
number of different colors in the draw is equal to 



E (6„) = a„P [X{i) > 1] = a„ 
Therefore, the expected value of unique recall is 



E(.«) = ^ = l-^. (3) 

Now, let us compute the asymptotic expectation for a — )■ oo and b propor- 
tional to a. We have 



V b ) ^ J a! {a-b-py. " ^ " P 

il) \0 ifb>a-p. 



(4) 



Now note that a and a — 6 tend to infinity whereas p is constant. Thus we have 
to analyze expressions like asymptotically (then plug a — b instead of a 

into this formula to get the asymptotic equivalent for the second factor of Eq. 4). 
This can be done by Stirling's formula 



n! = —V2nri (l + — + ( ^ ] ) , as n oo. 
e" V 12?i 



i2 



In a few steps we can derive 

lim^l. ("-^)' 

a-^co a-^co a! (fl— 6— p)! 

= lim ^^-^ = (l-rr. (5) 

a— >C30 CLf^ 

So we have finally 

lim E(r„) = f„ = l-(l-r)''. □ 



B.2 Unique recall for general distributions 

The previous approach of deriving the basic unique recall formula allows us to 
treat general redundancy distributions as well. Only now, each color i appears 
with redundancy p{i) or pi (Fig. 9). 



Proof {Prop. 1). From Eq. 2 we know that the likelihood of color i appearing 
in as random sample of size b at least once is 



P [X{i) > 1] = 1 - 
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a 


o 




oom 


Pi 


OOGO 




ooooo 



1 ... . . . a„ 

PieceK of information 



Fig. 9: An urn filled with balls of a„ different colors with varying redundancy 
p{i) for each color i. 



This equation again holds independently for each color i G which allows us 
to simply add the likelihoods of all colors and calculate the expected number of 
different colors 6,, in the draw as 



E(6„) = £P[XW>1] 



= a b ) 

i=l V6/ 

Therefore, we get as the exact combinatorial expected value r^i 

E(&„) 



'--k it) 



As we know from our previous limit consideration ( Eq. 5 I 
hm ^ (1 - rr , 

the exact equation can be simplified for large data sets (a — >■ oo) to 

hm E(rO = l--V(l~r)''' . (7) 

i—l 

Whereas the latter formula is much simpler to evaluate for a given redun- 
dancy distribution, the bias of redundancy is still described by the exact distri- 
bution of individual data items p = (pi,p2, ■■■,Pa^)- However, in a new step we 
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can transform it further to 



lim E(r„) = 



r„ = 1 - |] «fe (1 - rf , (8) 



fe=i 



with afc standing for the fraction of information with redundancy fc, fcmax being 
the maximum occurring redundancy pi, and the sum of the fractions X^a—T ~ 
1 summing up to 1. 

The variance Var — E (6^^) — (E {bu)Y can be calculated by similar but 
a bit more intricate calculations as 

From that, it can be shown that 

Var (r„) ^ ^ (1 - r)" (^1 - (1 - r)" (^1 + ^ j ^ (9) 

which tends to 0, as a — )■ oo. This means that the random variable r„ is asymp- 
totically concentrated around its mean value. □ 

Note, we can calculate the vector a = (afe) from p as 
|{z|pW = fc,*eNr,p = (p,)}| 

Oik = , 

flu 

with k G N^'"-''^, fcmax = Pi = niax(p), and = dim(p) being the number of 
different data items. The vector a presents an alternative description of bias in 
redundancy of data (Fig. 2). However, it is not an equivalent description without 
a„ explicitly stated, which we see by calculating p back from a by 

femax 

=min|/c| 2 ^ — ''^ € N^""'^a= K)} , (10) 
L I a^ J 

x—k 

with i e N"", femax = dim(a), and the number of total pieces of information a„ 
not explicitly given by a. More formally, we can state for two variations of the 
mapping: 

/ : p — !■ a ; non-injective mapping 
f : p ^ (a, a„) : injective mapping . 



B.3 Further observations 



Illustration of the limit value. Figure 3a illustrates with three example 
values that the basic unique recall formula poses a good approximation for |Eq. 3] 
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as exact combinatorial solution of E(7',i) for a — > oo with a being the total 
number of redundant data items. [Figure 3b| in addition plots the 5th and 95th 
percentile of the random variable r^. For this plot, we randomly sampled and 
averaged over 1000 times for each data point. Given any pair of values p and r, 
only certain combinations of a, 6,a„ and 6„, and thus, certain values of r„ are 
possible. As a consequence, the resulting percentile graphs are ragged. Finally, 
[Figure lO] illustrates with our running example |Fig. 2| f„ is a good approxi- 
mation not only for E (r), but also for E (b) — a„E (r). Even the absolute error 
in Zi = E (bu) — aufu generally decreases with the size of the data set. 






a 




Pi = (6,3,3,2,1) 


P2 


= (6,6,3,3,3,3,2,2,1,1) 


a 


oo 




ttu = 5, a = 15 




au = 10, a = 


30 


r 


fu 


6 


E (6„) a^fu A 


6 


E (bu) aufu 


A 


0.2 


0.455 
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2.420 2.274 0.146 
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4.684 4.548 


0.136 


0.4 


0.712 
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3.671 3.561 0.110 


12 


7.230 7.123 


0.107 


0.6 


0.862 


9 


4.369 4.308 0.061 


18 


8.677 8.616 


0.061 


0.8 


0.949 


12 


4.771 4.744 0.027 
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9.511 9.488 


0.023 



(c) 



Fig. 10: Comparing E(6„(p,r)) — the exact solution for the expected num- 
ber of pieces of information learned — for Pj^ and Pj with the approximate 
solution E (feu) ~ aufu{oi,r) 
of the exact solution |Eq. 6| 



shows 
for 



that 
large 



Prop. 1 is a good approximation 



data sets (a — > oo): For constant 



a = (|, |, |,0,0, |), not only the error of expected unique recall E (r^) — r^, 
but also the error in the expected number of learned pieces of information 
Z\ = E (feu) — aufu decreases, in general, with increasing a. 



Analogy. On a side note, the following problem presents an interesting math- 
ematical analogy: Assume that a web crawler finds each available online copy 
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of a research paper with probabihty 7. The probabihty of missing a document 
is 1 — 7, the probabihty of missing a document with c copies onhne (1 — 7)"^, 
and, hence, the probabihty of finding and indexing a document with c copies 
onhne 1 — (1 — 7)'^ [23^. Though the nature of the solution is the same as the 
one to our problem and both problems seem to be identical at first sight, the 
underlying question is different from asking how many different documents one 
could retrieve on average. The reason is that expectations cannot be added in 
the presence of mutual correlations. Looking at one particular document has an 
exact solution which is always true: 1— (1 — 7)'^. The exact answer to our problem 
is 



Eq. 3 which only approaches Eq. 1 when taking the limes. 



The difference is best illustrated with the first few red dots in Fig. 3a Ex- 



pected unique recall is 0.83 for a = 4 and the p = 2, r = 0.5 and a„ = 2 different 
documents, and not 0.75. 



Geometric interpretation. Equation 8| also allows an interesting geometric 



interpretation of unique recall for a general redundancy distribution as the mean 
of all unique recalls f„(fc, r) for uniform redundancies k < fci„ax, weighted by their 
fractions a^: 

femax 

fu=^ak fu{k,r) . 
fc=i 

The intuition why this formula must hold is the same why stratification in statis- 
tics does not change the expected outcome of a sampling process. In stratified 
sampling, first, a population to be sampled is grouped into MECE (mutually ex- 
clusive, collectively exhaustive) subgroups and then a fraction is sampled from 
each strata that is proportional to their relative sizes [21]. The mathematical 
justification is that, on average, the fractions sampled in a random draw are 
the same across all strata and the total population. For the same reason, when 
sampling a fraction r of the total amount of data, r will also be the expected 
fraction that is sampled from each subset or strata with constant redundancy. 



Hence, we get back Eq. 8 for the formula of unique recall f„(a, r) of a general 



redundancy distribution a: 

fe=i 

femax 

= £a.(l-(l-r)'=) 

fe=i 

femax 

= l-£afc(l-r)^ 

k=l 

We will use this possibility to average over fractions with constant redundancy 



again in Section D when we calculate the evolution of the general redundancy 
distribution. 
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C Details Section"!] (unique recall for power laws) 
C.l Power laws in the redundancy distribution 

Here, we consider the case when the frequency or redundancy p of an item is 
proportional to a power law with exponent 6 of its rank i: 

p{i) (xT^ , i € [au] ■ 

One often cited example of this distribution with J w 1 (Zipf distribution) is the 
frequency rank distribution of words appearing in arbitrary corpora |27j . 
In the normalized redundancy distribution, this power law translates into 

p(77) cx 7y-^ r^e [0, 1] , 

whereby the above two relations could only hold closely if real values were pos- 
sible for p(i) and p{ri). If we assume some underlying continuous process that is 
responsible for the observed discrete power law, then the natural way to model 
above relation is by rounding to the nearest possible redundancy fc G No, 

k{r]) — round(p(77)) 
= round (C • r]-^) 
= [C ■ T]-^ + 0.5J . 

The last step uses the floor function [xj to describe the greatest integer less or 
equal to x, which, in the next step, helps us calculate the fraction of information 
rjk that appears at least k times. As we only consider positive integers for k, we 
can leave away the floor function when expressing r]k = Ti{k) and get 

k^C-r]^:^ + 0.5 

or 

Vk - 



fc-0.5 



C 

As the fraction of information that appears at least one time is equal to 1 and, 
thus r]i = 1, we have C = 0.5. So we get 

r]k^{2k-iy'^ . 

From their definitions in Section 2| we know that afc = ry^ — rik+i and, hence, 

afe = (2fc- - (2fc + . (11) 

Then, from Prop. 1 we know f^ir) = 1 — J2T=i — ?')'^ and can now state 
approximate unique recall for a redundancy distribution that follows a power 
law with exponent S as 

oo 

f„(r) = 1 - ^ ((2fc (2k + 1)-?) (1 - r)'= . 

k=l 
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This infinite sum cannot be reduced in general. However, it does have a closed 
solution for (5 = 1. To see this, wc substitute with x = \/l — r to get 



f (x] = 1 


oo 

- V ( (2k - 

fe=i 


- (2A:4 






= 1 


oo 
k=l 


l)-Ja.2fe-i + 






= 1 


oo 

-x^{2k- 

k=l 








= 1 


oo 

-a;^(2fc- 
fe=i 


l)-ia;2fe-i_^ 


fc=i 




1 


2 ^ 












s 







It is interesting to observe the relation of the power series S to the polylogarithm: 
S consists just of the odd terms from the power series of the polylogarithm. To 
the best knowledge of the author, there is no generally known function defined for 
this series nor a way to reformulate it as a function of other basic and generally 
known functions defined in mathematics. For (5 = 1, however, S is simply the 
power series of tanh^^(a;) = artanh(a;), the inverse hyperbolic tangent [H p. 484] 
as 



artanh(a;) 



x^ 



Thus, analogous to the polylogarithm being a generalization of the logarithm 
with Lii(x) — log(a;), S could be considered a similar extension to artanh(x). 
Not being defined as such and, therefore, most likely not commonly found in 
mathematics, we can reformulate unique recall at least for (5 = 1 as 

1 - 

Tuix) — artanh(a;) . 

X 

Resubstituting \/l -~ r for x, we finally get 

^ti(^) = artanh(Vl — r) . 



C.2 Power laws in the redundancy frequency plot 



Next, we assume that the fraction of information that appears exactly k times 
follows a power law 

ak = C-k~\ keNi. 
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Here, we use /3 as exponent when ak follows a power law to distinguish this case 
from the previous one where p{ri) followed a power law with exponent S. From 
k — 1, we see that the constant of proportionality C is equal to ai and, hence, 

ak = aik^^ . 

Using the normalizing condition X^fc^i Q^fc = 1; we get 

CO 

fe=i 

The infinite sum on the left side is known in mathematics as the Riemann zeta 
function JT, p. 263], 



fe=i 



which allows to state 



and further 



m 
k-0 



From their definitions in [Section 2[ we know that the fraction of information r/^ 
that appears at least k times (or has redundancy > fc) is 

fc-i 

x=l 

fc-1 



1- — Vx-^ 



The series on the right side is known as the generalized harmonic number of 



order (fc — 1) of /?. The generalized harmonic number h'^'' of order fc of x [20l 
p. 74] is defined as 

x=l 

which, for k — oo, is equal to the Riemann zeta function: 



x=l 

We, therefore, have 



cm 



25 



Then, from Prop. 1 we know fu(r) = 1 — X]fc°=i ~ ^)'' ^-iid get as approx- 



imate unique recall 



oo 



fe=l 



The infinite series on the right side is known in mathematics as polylogarithm. 
The polylogarithm Liz{x) is defined as 



Liz (a;) 

fe=i 

which, for x = 1, is again equal to the Riemann zeta function: 

oo 

Li,(l) = ^fc-^ = CW . 
fe=i 

We can, therefore, write unique recall for a redundancy distribution where the 
redundancy frequencies follow a power law with exponent /3 as 

Li^(l-r) 

C.3 Power laws in the redundancy layers 

Here, we assume that the redundancy layers 77^ G [0, 1] follow a power law 

As ryi = 1 (the first layer must always be 1), we can directly write 

rjk = k~'' . 

From Oik = Tj]^ ~ Vk+i we know the fraction of information that appears exactly 
with redundancy k to be 

ak^k--' ~{k + iy\ (12) 
and from fu{r) = 1 — X^feLi Q!fe(l ~ we get approximate unique recall as 



r„(r) = 1 - ^ {k-^ - (fc + 1)-^) (1 - r)* 
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Like in the previous subsection, the infinite series can be expressed by the poly- 
logarithm; this time, however, after some transformations: 

OO CO 

f„(r) = 1 - ^ fc-T(l - rf + ^(fc + 1)-^(1 - rf 

k = l k=l 

OO OO 

= l-Y. fc-^(l - rf + — ^(fc + 1)-^(1 - rf+^ 

OO ^ OO 



r) 

fe=l " ' k=2 

OO OO 

= 1 - ^ fc-''(l - r)'' + J2 '^"''(1 -r)''-l 

k=l ^ k=l 



1 — r '^-^ 

k=l 



Using the definition for the polylogarithm, we learn unique recall for a redun- 
dancy distribution where the redundancy layers rjf^ follow a power law with 
exponent 7 as 

f„(r) = Li^(l -r) . 

1 — r 

For the special case of 7 = 1, we can use the property Lii(x) = — ln(l — x), 
and simplify unique recall as 

„ r In r 

ru[r) = . 

1 — r 

C.4 Comparing the power lav^r tails 

All three power laws show the typical power law straight line in the loglog redun- 
dancy frequency plot for their tails (Fig. 4a). The coefficients can be calculated 
from each other as follows. We first calculate 7 = 7(/3). From the binomial 
theorem, we can expand 

(fc + 1)-'' = fc-"^ - 7fc-''"i + li2^lilfc-^-2 _ 

Applying this formula to |Eq. 12[ we get 

ak = k^^ - (k + iy^ 

^7fc-^-^- ^^y^ fc-^-^ + --- 
Hence, we can calculate /3 from 7 by 



/3 = 7 + l 
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We next calculate 6 = S{f3). Again using the binomial theorem, this time to 
(2fc± we get 



(2fc)-7 ± f~^\ (2fc)-7-i + tiH^i^(2A;)-J-2 ± . . . 



Applying this formula to |Eq. 11[ we get 

ak = (2k ~ 1)^7 - {2k + iy^ 



= -2(-i)(2fcr 



and, hence, 



[Figure 4"e| shows the relations between the individual power law exponents. 



D Details Section 5 (evolution of redundancy 



distributions) 
D.l Evolution of the uniform distribution 

We develop a geometric model of stratified sampling to first deduce the evolution 
of the uniform distribution p = const. We will then generalize this approach in 
[Section D.2| and prove [Prop. 2[ 

We assume the total amount of information to be large. Therefore, our fo- 
cus can shift from individual pieces of information to fractions of uncountable 
information where each piece of evidence is infinitesimal. Without loss of gen- 
erality we set the total amount of unique information to 1. As all information 
has redundancy p, the total amount of data is then p and we can depict this 
uniform redundancy distribution as a stack of p layers of the same unique in- 



formation (Fig. 11a). We consider a random sampling process of fraction r in 
such way, that before sampling, we divide the population into different sub- 
populations or strata ('strata' means 'layers'), and then take samples from all 
sub-populations in proportion to their relative sizes. In statistics, this process is 
known as stratification, the process of grouping members of the population into 
relatively homogeneous and MECE (mutually exclusive, collectively exhaustive) 
subgroups and then proportional allocation of sample sizes to the subgroups [53] . 
As, on average, the fractions sampled in a random draw are the same across all 
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Fig. 11: (a): Geometric interpretation of sampling from a normalized uni- 
form redundancy distribution: Sampling happens from layer 1 to layer p, 
one after the other. In each layer, existing divisions are further divided into 
two fractions of size proportional to r (grey) and 1 — r (white), (b): The pro- 
cess of building the divisions from one layer to the next can be compared to 
a upside-down tree where at each node going left happens with probability 
r and right with probability 1 — r. 



strata and the total population, stratification does not change the expected out- 
come of a sampling process. In our case, where we consider the limit case of very 
large data sets with a — oo data points, this step of subdividing populations 
and then sampling in proportion to their sizes can be repeated arbitrarily often 
and does not change the expected outcome of the overall sampling process. 

We start bottom up from layer 1 to layer p, and at each layer further divide 
all existing divisions from previous layers into two parts: one of relative size r 
from which we sample (grey) and one of relative size 1 — r from which we do 
not sample (white). At the first layer, we divide into two strata: a fraction r 
of sampled information and a fraction 1 — r of unsampled information. In the 
second layer, we first divide the total amount of information into the same two 
groups of information already seen in the first layer of size r and a second group 
with information not yet seen of size 1 — r. Then we choose samples from both 
subgroups of proportion r of their sizes, thus getting one fraction of size of 
twice seen information, two strata with size r(l — r) of once seen information, 
and one strata with size (1 — r)^ of not yet sampled information in either layer. 
Iteratively repeating this process, we have 2^ divisions of the total amount of 
information at any layer k. The formation of the divisions in the highest layer 
p can be imagined by the growth of a tree where each division is connected to 
the division in the previous layer from which it originated (Fig. lib). The size 
of each division depends on the number of times it was created by choosing the 
sample option with proportion r or the not-sample option with size 1 — r. As 
an example, the arrows in |Fig. llb| point to those 4 divisions in the fourth layer 
which represent one time sampling and three times non-sampling out and which 
are, therefore, of size r(l — rY . More general, the size of the divisions in layer 
p representing k times sampling and p — k times non sampling is r^{l — r)P~^ . 
The number of such divisions is equal to the number of ways that k objects 
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can be chosen from among p objects, regardless of order, which is equal to the 
binomial coefficient (^). Multiplying these number, we get as result that the 
expected fraction of information that appears with redundancy fc in a sample 
from a uniform redundancy distribution with redundancy p is equal to the bi- 
nomial distribution (^)r*''(l — r)P~'', which is the probability of getting exactly 
k successes in p independent yes/no experiments, each of which yields success 
with probability r: 



^k{k,p,r) 



r''(l - r) 



p—k 



Note that Ak{k, p, r) is actually defined for all p e Ni, A: e No and r E [0, 1]: 



^k{k,p,r) 




r) 



p-k 



if Q <k < p 

if < fc = p 

if = fc < p 

if k > p . 



The evolution ujk can then be simply calculated from a;^ = 1 — X]u=o 



y=0 ■ 



fc-1 



6j,{k,p,r)^l-Y,[ )rHl-ry-y 



y=o 



(13) 



[Figure 12| illustrates this result. 



D.2 Proof Prop. 2 (evolution of general distributions) 



Proof. We again use stratification and divide the overall redundancy distribution 
into homogenous blocks with constant redundancy x and unique information ax 
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an 




Information ?; Information rj 

(a) (b) 



Fig. 13: The evolution ui^ of the fc-th redundancy layer of a general redun- 
dancy distribution a is the weighted mean of the evolution of this layer in 
in all uniform distributions. 



before sampling a fraction r from each block in tm'n (Fig. 13b I. The mathemat- 
ical justification is that, on average, the fractions sampled in a random draw are 
the same across all sub-populations and the total population 

We know that sampling from each block with constant redundancy x follows 



the previously established relationship of evolution of the uniform (Eq. 13). The 
amount of information with redundancy k is then the base of the block times 
Zife(/e, r). At the same time, the total amount of information Ak{k, (X,r) with 
redundancy k is equal to the sum of axAk{k,x,r) in each block. Hence, the 
evolution of a general redundancy distribution is the mean of the evolution of of 
all uniform distributions x < fcmax, weighted by their fractions ax- 



Ak{k, a,r) = ^ a^A^ik, x, r) 

x = l 

x=l ^ ^ 



The evolution of cu(A:, a, r) can again be simply calculated stepwise from 
a, r) by = i-^k-i ^ ^fc-i with cji = 1 — Aq as 



k-l 

Wfc = 1 - ^ Z\y , 

k~l m 



y=0 x=y 

Further, k-recall is then given by 



ruk = — ■ □ 

Vk 
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D.3 Illustration of the limit value of |Prop. 2| 



Figure 14 illustrates that the mathematics developed indeed predicts sampling 
from large data sets very well. In grey, the individual frames show the horizontal 
evolution of the example normalized redundancy distribution a = (|, |, |, 0, 0, |) 
for r = 0.5. In red, they show the actual expected vertical redundancy distri- 
butions as given by Monte Carlo simulations, where the total number of pieces 
of information a„ ~ a increases from 5 in Fig. 14a to 1000 in Fig. 14d Note 
how the horizontal layers of redundancy become visible in the vertical perspec- 
tive with increasing a. These layers become even more apparent in the sample 
redundancy distribution when taking the median instead of the mean of indi- 
vidual draws in the Monte Carlo simulations (we show here the mean). This 
observation suggests that the normalized, horizontal perspective is actually the 
inherently natural perspective to analyze sampling from large data sets. 
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Fig. 14: Comparing the expected evolution of the redundancy layers ruk 
(grey) with the expected sample redundancy distributions pi given by 
Monte Carlo simulations (red): For large data sets (au oo), Prop. 2 pre- 
dicts the sampled distributions increasingly well and the horizontal redun- 
dancy layers actually become visible in the vertical sample distributions. 
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E Details Section 6 (evolution of power laws) 



E.l Proof Lemma 1 (evolution of the invariant) 

Proof. We prove that the redundancy distribution a/ = (ak) with 



ak = (-1) 



and < T < 1 



is invariant under samphng. We show this by proving that the foUowing holds 
for «/: 

ruk{k, a,r) = . 

The statement then follows from r„j, being independent of k and r^k = fui = '^i- 
We make use of the following easy-to-verify identity 



T-k\ 

i-k) ' 



and use 9k = — to describe the fraction of information with redundancy equal 
k in the sample to that with redundancy equal k in the original distribution. 
Starting from 

oo 

Ak{k, a, r) ^ Ak{k, i, r)ai , 

i—k 

we can write 9k = A^/ak 

oo 

0fe(fc, a,r) = ^Z\fe(/c,i, 

i—k 

Then, for OC/ = (q;^) 



o^k 



i—k 

oo 



\i — k 



(-i)'-^(I) 



oo 



i—k 



fe 

T-k 

i — k 



\i—k 



T — k 
i — k 



T — k 



1=0 

((r - 1) + 1)"-'= 
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Further, 



and, hence, 



x—k~\-l x — k+1 



r„fe(fc, «/, r) = r'^ , with < T < 1 . □ 



E.2 Necessary condition for invariants during sampling 

We next argue that the invariant family is the only type of redundancy distribu- 
tion that remains invariant under sampling. We proceed in two step. We first give 
an conjectured property that must hold for every distribution that is invariant. 
As our argumentation does not stand the requirements of a rigorous proof, we 
call this [Conjecture l" We then show with [Corollary 1[ if this conjecture is true, 



then the invariant family is indeed the only redundancy distribution invariant 
under sampling without replacement. 

Conjecture 1 (Necessary condition for invariants). The following is a 
necessary condition for an invariant of sampling 

r„fc(fc, a, r) = r'^ with < t < 1 . (14) 

Intuitive argument for \C'onjecture 1\ If a function remains invariant during 
the evolution, then we know that the k-recall r^k is the same for each redun- 
dancy layer k. Now while the overall recall r grows, the total amount of sampled 
data has to be accommodated by the "space" formed by the growing layers of 
redundancy. This space is formed by the dimensionality of the shape of the dis- 
tribution. While this shape is filled with more and more data, unique recall has 
to grow according to some function that simulated this filling of the space. Com- 
paring such a shape with a higher dimensional triangle or tetrahedron of higher 



dimension (Fig. 15a and Fig. 15b), the functions would be the n-th square for 



dimension n, which translates in a function that grows according to 

ruk{0ii,r) = r^/" . 

Since unique recall r„i is concave and always bigger than r except for r G {0, 1}, 
n must be bigger than 1. Hence, the following condition must hold 



' uk 



(a/, r) = r^ , with < t < 1 . (15) 



[Figure 15c[ and [Fig. 15d[ show such an example invariant redundancy distribution 
with n = 2 and the resulting unique recall function with r^ = \fr. The function 
could therefore be called a function of dimensionality 2. 

Corollary 1 (Invariant family). No other distribution than the distributions 
defined in\Lemma 1\ is invariant under sampling. 
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Fig. 15: Intuition for Corollary 1 When sampling from a smooth and regular 
n-dimensional data space, the only function that can describe a realistic 
concave sampling success is the n-th root, thus giving information = \/ data. 



Proof. Assume that |Eq. 15| holds. We show that the invariant family of |Lemma 1| 
necessarily follows from this conjecture. First, we notice that Eq. 15 also has to 
hold for k — 1 and, hence, we have with 771 = 1, 

ru{ai,r) ^ . 

Calculating the derivatives of r„(r), we get 

ru{r) = 

r::(r)=T(r-l)r--2 

and for the interesting point r = 1, 

rW(r) 
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At the same time, we get from |Prop. 1[ 

m 

fu{r) = 1 - ^ Qffc (1 - r)^ 



k=l 



fc-i 



k=l 



k=2 



\k—n 



(16) 



The term fc— in the last equation stands for the falhng factorial powers (or short, 
"falling factorial" or "fc to the n falling") fe^ = fc(fc - 1) • • • (fe - n + 1) [1 3 p.47]. 
For the end point r = 1, or actually taking the limit value for r -> 1 of |Eq. 16l 
we get that, in the limit, all terms in the sum disappear except for k = n: 



fi") (r) = lim fi") (r) = {-ly-^nl 



r=l r-s-l 



and we can express a^, the fraction of information with redundancy fc, as simple 
function of the fc"' derivative of : 



ak = (-1) 



k\ 



From that, we can now calculate «/ as 



ak = {-ly 
-(-1) 



k\ 



fe-1 



kl 



where the last equation can be written as 
since the binomial coefficient is defined for all t e M [SIT, p. 51] 



E.3 Proof Lemma 2 (sampling from complete power laws) 

The statement of the tail remaining invariant during evolution is equivalent to 

lim — ^ = 1 , 

and, hence r„fe being independent of k for large k. This is what we prove in two 
steps. First we have to prove the limit value in the following Lemma 4 then use 
this limit value to prove [Lemma 2| 
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Lemma 4 (Convergence of 0{k,i, /3)). 



Urn f_l ) =l,fori>fc,/3>l , (17) 



0{k,i,P) 



Proof. 



i—k 



(fc-/3 + l)* 



ij (fc + l)'-*^ 

k\~'^ (k- py-^+^ k 
kY^{i-pf {if 



I transform to rising factorial 
\a^k 

\Pi{i,...,k} 



(1 - 

Note that ^{k,fi) = f2{k,P)~^. Hence, to show \mi}^^^ 0{k,i,(3) = 1 for i > fc, 
it suffices to show (i) that f2{k, 0) monotonically increases in k above a certain 
fco, and (ii) that n{k,j3) converges for fc — oo. 
Monotonicity follows from direct calculation: 

n{k + i,p) > f2{k,p) 

(fc + l)-''(A; + l)! ^ k-l^k\ 



(1-^)^(^ + 1-^) (1-^)*= 

> 



A:+l /k + 1^^ 



k + l-(3 



>{-^ \j:=k-l>2 



J -13 Vj - 1 
(j - 1)'^ > (j - 



/ - (^^f^^ + (^^f~^ -...> f - [binomial theorem 



/-^-...>o, 



which is true for j > /? — 1, and, hence k > ko = max[/3, 2]. 
Convergence for P follows from 

lim f2{n, P) = r{l - P) , 
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which follows from Euler's formula for the Gamma function [U Eq. 6.1.2] 



or 



r{z) = lim 



r{l - /3) = lim 



(z)"+i 



(1-/3)^ 



lim n{n, /3) = lim 



lim 



n+l 



r(l - /3) lim 

r(i-/3) . 



n + 1 - /3 



In the above derivation, we had to state /3 ^ N due to the otherwise undefined 
value of ^. To include all /3 > 1, we can note 



k-z+l ' 



iz-13) 



(18) 



where z e N is chosen soi>fc>z>/3>l. 



Proof (Lemma 2). From Fig. 4[ we know that the three power laws have the 
same tail distributions. Without loss of generality, we consider here power laws 
in the redundancy frequency plot Oic with 



Oik 



W) 



Starting from 



Ak{k, a, r) = ^ Ak{k, i, r)c 

i—k 

we can write 9k = Ak/ak as 



,(fc, a,r) = S2Ak{k,i,r)— , 

O-k 

i—k 



and, for ac = (at) 



,(fc,ac,r) = 5](Mr'=(l-r) 



i=k 
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From [Eg. 17[ we know 



fe->oo \kj\kj \k 



■z-k 



*\ (^/3)...(fc-/? + l) 
k 



i\ (/3-fc - l)---(/3-i) 
k) iEE 



(i-fc)!" 

l^_-|^>j(i-fe) 



P-k-1 
i — k 



We can now write 



lim ac, r) = ^ r^(l - r)*-^(-l)'-'= [ u 



oo 



Then, 



r'^ ((r - 1) + 1)^"''-^ 



lim u;fc(A:, acr) = lim Z\ 

a:=fe+l 



= lim 6*2:03: 

fe— yoo 

a;=A;+l 



and, hence, 

lim r^k{k, ac,r) ^ r'^^^ , 
which is independent of A:. □ 
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[Figure 16] illustrates |Lemma 2| for 3 power law coefficients. 




Fig. 16: Sampling from any completely developed power law leads to other 
power laws since k-recall Tuk is independent of k above a certain threshold. 
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F Proof Theorem 1 (sampling from truncated power 
laws) 

Proof. [Theorem 1| follows readily from [Lemma "2} Assume is a truncated 
power law with maximum redundancy k^a,x- (1) for k > fcmax ^fc = by defini- 
tion. (2) For k < fcniax we know 

femax 

Ak{k, a, r) = ^ Ak{k, i, r)ai 

i—k 

oo oo 

^^Ak{k,i,r)ai- ^ Ak{k,i,r)ai , 

i=k i — fcmax 

(2a) For k <^ fcmax, the second summand is small as compared to the first 
one and we get 

lim Ak{k,aT,r) ^ Ak{k,ac,r) 

from which follows that at the lower side, the sample distribution from a trun- 
cated power law behaves the same as from a completely developed power law. 

(2b) For k — > fcmax the second term becomes increasingly dominant and Ak 
and, hence, r^k too become smaller. If now fcmax is sufficiently enough (large data 
sets), the observed distribution must have an observed power law distribution. 

□ 



Figure 17| together with Figure 5c and Fig. 5d illustrate Theorem 1 
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Fig. IT: 



Theorem 1 



Sampling from truncated power law distributions leads 
to power law distributions with the tail "breaking in" for increasingly lower 
recalls. However, the core of the power law still shows ruk ~ r'' , and the 
sample distribution thus is a power law. The larger the data set, the better 
the approximation. 



